2022 Medium Articles Analysis Scraped with Python

8 min read

false

Scraped and analyzed 6432 articles published by Towards Data Science in 2022.

Introduction

When I start publishing articles regularly, I always have many questions in my mind. I read many articles, yet none of them satisfied me fully. Because the articles that I read, gave an answer to the question of their minds. So I did my research, on how to do that on my own in the last year. Yet I have many other things to do, that’s why I postponed this analysis. On the other hand, I created a medium scraper Jupyter notebook and before 2022 is finished, I want to cut loose ends.

That’s why I scrape a big amount of data from the medium starting in 2014, yet during this time, I managed to clean 2022 articles, which have 6605 articles data.

That actually contains all articles published in TDS in 2022. You can find that in Kaggle, which I recently added to there. You can find this data set in here. Feel free to visit there, create a notebook and analyze the data set and publish your notebook.

In this article, I try to find an answer, which comes to my mind, when I start writing from a medium.

  • What is the number of articles per reading time, that have been published in TDS in 2022?.
  • Which day is the best day to publish? Should I publish on weekdays or weekends?
  • Who are the top 15 writers in TDS, that published the most articles in 2022?
  • Who are the top 10 writers in TDS, whose articles are liked most per article?
  • What is the average per season? In which season should I publish my article series?
  • What is the average per month? What is the top 5 article which liked most?

At the end of the article, I also did a Z test by using Python to answer the following questions?

  • Does the article have more likes if the article contains “data”?
  • Does the article have more like if the title of the article contains “machine learning”?
  • Does the article have more like if the title of the article contains “Python”?

Now, let’s start analyzing by answering questions.

What is the number of articles per reading time, that have been published in TDS in 2022?

Here in this graph, you can see the number of articles per reading time that have been published in Towards Data Science in the 2022 year. This graph illustrates the distribution of articles across different reading times

Image by Author

Which day is the best day to publish?

Here in that article, you can see the best day to publish can be determined by analyzing average likes. Apparently, Friday is the best day for publishing an article, yet there is any drastic difference exist between each day. Also, I once assumed that I might have fewer like on weekends, yet this graph shows my assumption was not right.

Image by Author

Should I publish on weekdays or weekends?

To determine whether you should publish on weekdays or weekends, you would want to analyze the article’s average likes during weekdays and weekends. As we can see from the latter question too, there are not any significant changes.

Image by Author

Who are the top 15 writers in TDS, that published the most articles in 2022?

Here we can see the top 15 writers, who have published the most articles in 2022. It can be determined the amount of data they published in 2022.

Image by Author

Let’s find out the most successful writers.

Who are the top 10 writers in TDS, whose articles are liked most per article?

Here you can see the top 10 writers in TDS, whose articles are liked most per article. It can be determined by analyzing data on the number of likes by each article and then calculating the average number of likes per article for each writer.

However, to see better, I have one restriction.

I selected the writers who published at least 5 articles in 2022.

Image by Author

What is the average per season? In which season should I publish my article series?

The average per season can be determined by analyzing data on the number of likes received by articles published in each season (Spring, Summer, Fall, Winter). 

This bar graph shows the average number of likes by articles in each season, allowing you to determine which season has the highest average.

Or if you plan to publish a series of articles, it looks like summer is the best season to start with.

Image by Author

What is the average per month?

Here you can see the average number of likes per article by month. It is obvious that December is the worst month for publishing articles for TDS, yet August is the best month to publish. As we can see in our previous graph, also summer is the best season to get more likes.

Image by Author

Now let’s see the same graph by starting it from January.

Here;

Image by Author

What is the top 5 article which liked most?

The top 5 articles that are liked the most can be determined by analyzing data on the number of likes received by each article.

Image by Author

WordCloud

A word cloud is a graphic representation of the most frequently used words in a given text or set of texts.

 It typically shows words in different sizes and font weights, with the most frequently used words appearing in larger font sizes, and less frequently used words appearing in smaller font sizes.

Word clouds can be created using various text analysis techniques, such as counting the frequency of words or using natural language processing techniques. 

They are often used to quickly identify the most important themes or topics in a text, as well as to explore the relationships between different words.

Now let’s see our word cloud analysis according to the title to find out the keywords.

Image by Author

Z Test

Now, we analyzed our data by seeing the graphs

Does the article have more likes if the article contains “data”?

Choosing the right topic is really vital regarding the success of a blog post. That’s why, in this section, I try to find an answer to my three questions.

Here are my questions:

  • Does the article have more likes if the article contains “data”?
  • Does the article have more like if the title contains “machine learning”?
  • Does the article have more like if the title of the article contains “Python”?

To answer those questions, I will do Hypothesis testing with Z.

Now our null hypothesis says this assumption is not valid so there is not any relation between likes and the “ Data“ keywords existence in the title. 

Alright, let’s start.

Here is null and alternative Hypothesis:

Ho: The articles that contain the "Data" keyword are not more similar than others.
Ha: The articles that do not contain the "Data" keyword have more likes than others.
df_d = df2[df2['title'].str.contains('Data')]
n = df_d.shape[0]
df_not_d = df2[~df2['title'].str.contains('Data')]
m = df_not_d.shape[0]
x = df_d["like"].values.mean()
y = df_not_d["like"].values.mean()
print("Average like per article which contains Data word is : {}".format(x))
print("Average like per article which does not contains Data word is : {}".format(y))
Output:
Average like per article which contains Data word is : 145.27632461435277
Average like per article which does not contains Data word is : 126.16352964986845
x_var = df_d["like"].values.var()
y_var = df_not_d["like"].values.var()
print("Variance of like per article which contains Data word is : {}".format(x_var))
print("Variance of like per article which does not contains Data word is : {}".format(y_var))
Output:
Variance of like per article which contains Data word is : 34623.71036502944
Variance of like per article which does not contains Data word is : 35591.299305412445

Calculating Z Score

z = (x - y)/np.sqrt(x_var/n + y_var/m)
z
Output : 3.4650416548218073

Calculating P Values

p = 1 - norm.cdf(z)
p
Output : 0.00026507467906666804

Now it looks like our p-value is really small.

What is Z Score?

The z-score tells us how many standard deviations away the sample mean (x) is from the population mean (y) for the articles that contain the keyword “Data” and the articles that do not. 

A large positive z-score indicates that the sample mean is far away from the population mean and suggests that there is a significant difference between the two groups.

Then the p-value is calculated by subtracting the cumulative distribution function (cdf) of the standard normal distribution from 1. 

What is the P Score?

The p-value represents the probability that the results from the sample are due to chance. A small p-value (typically less than 0.05) indicates strong evidence against the null hypothesis, meaning that there is likely a significant difference between the two groups.

The output shows that the calculated z-score is 3.46 and the p-value is 0.00026. 

These values suggest that there is a significant difference between the articles that contain the “Data” keyword and those that do not, in terms of the number of likes they receive. 

With such a small p-value, it is highly likely that the differences in likes are not due to chance.

Takeaway

The title that contains “Data” will get more likes statistically.

Does the article have more like if the title of the article contains “machine learning”?

Ho: The articles that contain the "Machine Learning" keyword are not more similar than others.
Ha: The articles that do not contain the "Machine Learning" keyword have more likes than others.
df_ml = df2[df2['title'].str.contains('Machine Learning')]
n = df_ml.shape[0]
df_not_ml = df2[~df2['title'].str.contains('Machine Learning')]
m = df_not_ml.shape[0]
x = df_ml["like"].values.mean()
y = df_not_ml["like"].values.mean()
print("Average like per article which contains Machine Learning word is : {}".format(x))
print("Average like per article which does not contains Machine Learning word is : {}".format(y))
Output:
Average like per article which contains Machine Learning word is : 126.07432432432432
Average like per article which does not contains Machine Learning word is : 130.8120925684485
x_var = df_ml["like"].values.var()
y_var = df_not_ml["like"].values.var()
print("Variance of like per article which contains python word is : {}".format(x_var))
print("Variance of like per article which does not contains python word is : {}".format(y_var))
Variance of like per article which contains python word is : 20565.70393535427
Variance of like per article which does not contains python word is : 36148.17117710747
z = (x - y)/np.sqrt(x_var/n + y_var/m)
z
-0.5457262785917103
p = 1 - norm.cdf(z)
p
Output:
0.7073729473003265

Does the article have more like if the title of the article contains “Python”?

Ho: The articles that contain the "Python" keyword are not more similar than others.
Ha: The articles that do not contain the "Python" keyword have more likes than others.
df_python = df2[df2['title'].str.contains('Python')]
n = df_python.shape[0]
df_not_python = df2[~df2['title'].str.contains('Python')]
m = df_not_python.shape[0]
x = df_python["like"].values.mean()
y = df_not_python["like"].values.mean()
print("Average like per article which contains python word is : {}".format(x))
print("Average like per article which does not contains python word is : {}".format(y))
Output:
Average like per article which contains python word is : 156.37653631284917
Average like per article which does not contains python word is : 126.42658479320932
x_var = df_python["like"].values.var()
y_var = df_not_python["like"].values.var()
print("Variance of like per article which contains python word is : {}".format(x_var))
print("Variance of like per article which does not contains python word is : {}".format(y_var))
Variance of like per article which contains python word is : 39885.99341593583
Variance of like per article which does not contains python word is : 34587.302945045776
z = (x - y)/np.sqrt(x_var/n + y_var/m)
z
4.201586041664072
p = 1 - norm.cdf(z)
p
1.3252573576538751e-05

It looks like titles contain “Python”, have more likes like “Data”.

Conclusion

In this article, I answered a wide range of questions, which aims to get more likes in Medium, includes different reading times, best day to publish, best month and season to publish on Towards Data Science in 2022. To conduct this analysis, I used Python to scrape Medium articles.

I found that the most liked articles will be in summer and august specifically and the best day to publish article is Friday. I also find the top 15 writers in Towards Data Science who have published the most articles in 2022, and the top 15 writers in Towards Data Science who have published and gained most likes per article.

My analysis also found that articles tend to receive more views and likes during the summer seasons, and in the month of August.

Additionally, I also did a Z-test to find if articles that contain the keywords “data”, “machine learning” or “Python” in the title received more likes than other articles. The Z-test suggested that articles with “Python” and “Data” keywords had more likes than others.

Overall, I was able to provide a comprehensive analysis of Medium articles published in Towards Data Science in 2022. 

Thanks for reading my article.

Here is my Numpy cheat sheet.

Here is the source code of the “How to be a Billionaire” data project.

Here is the source code of the “Classification Task with 6 Different Algorithms using Python” data project.

Here is the source code of the “Decision Tree in Energy Efficiency Analysis” data project.

If you still are not a member of Medium and are eager to learn by reading, here is my referral link.

“Machine learning is the last invention that humanity will ever need to make.”

Nick Bostrom
Gencay I. Machine Learning & Mechanical Engineer | Technical Content Writer | For free Cheat sheet ? ? https://gencay.ck.page/

Leave a Reply

Your email address will not be published. Required fields are marked *